面向生成式视觉感知的细粒度直接偏好对齐框架

doi:10.16451/j.cnki.issn1003-6059.202603004

摘要
图/表
参考文献
相关文章 (1)

全文: PDF (1427 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要基于多模态大语言模型的生成式指代分割方法缺乏对生成质量提升途径的深层探索,受限于监督微调的模仿机制,在复杂场景中面临语义定位偏差与掩码边界粗糙的挑战.为此,文中提出面向生成式视觉感知的细粒度直接偏好对齐框架(Fine-Grained Direct Preference Alignment Method for Generative Visual Perception, FG-DPA),将DPO(Direct Preference Optimization)从文本领域迁移至像素级分割任务中,构建高-低质量的掩码偏好对,引导方法在隐空间学习精准的视觉表征.利用SAM(Segment Anything Model)的交互特性构建两类负样本:为了解决边缘不精细问题,在真值包围盒内引入对抗性点提示,生成局部缺失或溢出的低质量掩码作为边缘负例;为了解决目标定位错误问题,在背景区域随机采样生成非重叠掩码,构建语义级定位负例.经过多样本偏好对的训练,结合SAM实现高精度的掩码分割.在多个数据集上实验表明,FG-DPA可有效抑制定位幻觉,显著提升掩码生成的完整性与边缘准确度,在提升多模态生成式视觉感知性能方面是有效的.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	谢涛
	袁玉轩
	左旺孟
	李瑞峰
	赵立军

关键词 ：多模态大语言模型, 指代分割, 生成式视觉感知, 数据构造

Abstract：Generative referring segmentation methods based on multimodal large language model(MLLM) are limited by the mechanism of Supervised Fine-Tuning and lack in-depth exploration of ways to improve generation quality. Therefore, these methods are faced with the challenges of semantic localization bias and rough mask boundaries in complex scenarios. To address these issues, a fine-grained direct preference alignment framework for generative visual perception(FG-DPA) is proposed. The direct preference optimization(DPO) algorithm is transferred from text understanding to the pixel-level segmentation task. High-quality and low-quality mask preference pairs are constructed to guide the method toward learning more accurate visual representations within the latent space. Two types of negative samples are produced by leveraging the interactive characteristics of the segment anything model(SAM). To address the issue of imprecise edges, adversarial point prompts are introduced into the ground-truth bounding box to generate low-quality masks with local omissions or overflows as negative examples. To solve the problem of incorrect target localization, non-overlapping masks are randomly sampled in the background region to construct semantic-level negative examples. Through training with multiple samples, accurate segmentation is finally achieved in conjunction with SAM. Experiments on multiple public datasets show that FG-DPA effectively suppresses localization hallucination and significantly improves the completeness and edge accuracy of mask generation, validating its effectiveness in enhancing multimodal generative visual perception performance.

Key words： Multimodal Large Language Model Referring Segmentation Generative Visual Perception Data Construction

收稿日期: 2026-01-19

ZTFLH:	TP18
	TP391.41

基金资助:国家自然科学基金项目(No.62073101)、中国博士后科学基金项目(No.GZC20252736)、机器人技术与系统国家重点实验室(哈尔滨工业大学)自主课题项目(No.SKLRS202417B,SKLRS202501C)、安徽省机器视觉检测重点实验室开放基金项目(No.KLMVI-2024-HIT-18)、黑龙江省“揭榜挂帅”科技攻关项目(No.2023ZXJ01A02)资助

通讯作者: 赵立军,博士,研究员,主要研究方向为移动机器人环境层次化感知技术、动态环境下交互式服务机器人导航方法、基于具身智能的特种机器人作业方法等.E-mail:zhaolj@hit.edu.cn.

作者简介: 谢涛,博士,副研究员,主要研究方向为智能移动机器人感知技术、人机智能交互技术、具身智能等.E-mail:xietao1997@hit.edu.cn.
袁玉轩,硕士研究生,主要研究方向为同时定位与地图构建、具身智能.E-mail:24S008093@stu.hit.edu.cn.
左旺孟,博士,教授,主要研究方向为计算机视觉、机器学习等.E-mail:cswmzuo@gmail.com.
李瑞峰,博士,教授,主要研究方向为移动机器人自主定位与环境感知、动态环境下服务机器人运动规划及人机交互、智能机器人设计、系统创新集成等.E-mail:lrf100@hit.edu.cn.

引用本文:

谢涛, 袁玉轩, 左旺孟, 李瑞峰, 赵立军. 面向生成式视觉感知的细粒度直接偏好对齐框架[J]. 模式识别与人工智能, 2026, 39(3): 239-249. XIE Tao, YUAN Yuxuan, ZUO Wangmeng, LI Ruifeng, ZHAO Lijun. Fine-Grained Direct Preference Alignment Framework for Generative Visual Perception. Pattern Recognition and Artificial Intelligence, 2026, 39(3): 239-249.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202603004 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2026/V39/I3/239

[1] LIU H T, LI C Y, WU Q Y, et al. Visual Instruction Tuning // Proc of the 37th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023: 34892-34916.
[2] LI J N, LI D X, SAVARESE S, et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models // Proc of the 40th International Conference on Machine Learning. San Diego, USA: JMLR, 2023: 19730-19742.
[3] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: A Visual Language Model for Few-Shot Learning // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 23716-23736.
[4] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 Technical Report[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2303.08774.
[5] HU R H, ROHRBACH M, DARRELL T. Segmentation from Natural Language Expressions // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 108-124.
[6] FAN Z D, ZHANG C, GAO J Y, et al. GFMLLM: Enhance Multi-modal Large Language Model for Global and Fine-Grained Visual Spatial Perception. Expert Systems with Applications, 2026, 299(D). DOI: 10.1016/j.eswa.2025.130239.
[7] LAN M C, CHEN C F, ZHOU Y, et al. Text4seg: Reimagining Image Segmentation as Text Generation[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2410.09855v1.
[8] CHEN Y C, LI W H, SUN C, et al. SAM4MLLM: Enhance Multi-modal Large Language Model for Referring Expression Segmentation // Proc of the 18th European Conference on Computer Vision. Berlin, Germany: Springer, 2024: 323-340.
[9] RAFAILOV R, SHARMA A, MITCHELL E, et al. Direct Prefe-rence Optimization: Your Language Model Is Secretly a Reward Mo-del // Proc of the 37th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023, 36: 53728-53741.
[10] KIRILLOV A, MINTUN E, RAVI N, et al. Segment Anything // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 3992-4002.
[11] LAI X, TIAN Z T, CHEN Y K, et al. LISA: Reasoning Segmentation via Large Language Model // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 9579-9589.
[12] XIA Z F, HAN D C, HAN Y Z, et al. GSVA: Generalized Segmentation via Multimodal Large Language Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 3858-3869.
[13] REN Z W, HUANG Z C, WEI Y C, et al. PixeLLM: Pixel Reasoning with Large Multimodal Model // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 26374-26383.
[14] RASHEED H, MAAZ M, SHAJI S, et al. GLaMM: Pixel Grounding Large Multimodal Model // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 13009-13018.
[15] ZHU L Y, CHEN T R, XU Q X, et al. POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 30231-30240.
[16] WU J N, ZHONG M Y, XING S, et al. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks // Proc of the 38th International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2024: 69925-69975.
[17] LAMBERT N. Reinforcement Learning from Human Feedback[C/OL]. [2025-12-17].https://arxiv.org/pdf/2504.12501.
[18] OUYANG L, WU J, JIANG X, et al. Training Language Models to Follow Instructions with Human Feedback // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 27730-27744.
[19] YANG A, LI A F, YANG B S, et al. Qwen3 Technical Report[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2505.09388.
[20] HU E, SHEN Y L, WALLIS P, et al. LoRA: Low-Rank Adaptation of Large Language Models[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2106.09685.
[21] AMINABADI R Y, RAJBHANDARI S, AWAN A A, et al. DeepSpeed-Inference: Enabling Efficient Inference of Transformer Mo-dels at Unprecedented Scale // Proc of the International Conference for High Performance Computing, Networking, Storage and Analysis. Washington, USA: IEEE, 2022. DOI: 10.1109/SC41404.2022.00051.
[22] LOSHCHILOV I, HUTTER F. Decoupled Weight Decay Regularization[C/OL]. [2025-12-17].https://arxiv.org/pdf/1711.05101.
[23] LI W Z, ZHAO Z C, BAI H C, et al. Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation. IEEE Transactions on Multimedia, 2025, 27: 6059-6069.
[24] YAN B, JIANG Y, WU J N, et al. Universal Instance Perception as Object Discovery and Retrieval // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 15325-15336.
[25] XIAO L H, YANG X S, PENG F, et al. CLIP-VG: Self-Paced Curriculum Adapting of CLIP for Visual Grounding. IEEE Transactions on Multimedia, 2024, 26: 4334-4347.
[26] KAMATH A, SINGH M, LECUN Y, et al. MDETR-Modulated Detection for End-to-End Multi-modal Understanding // Proc of the IEEE/CVF International Conference on Computer Vision. Washing-ton, USA: IEEE, 2021: 1760-1770.
[27] LI X C, FAN B Y, ZHANG R Z, et al. Inexactly Matched Refe-rring Expression Comprehension with Rationale. IEEE Transactions on Multimedia, 2024, 26: 3937-3950.
[28] KE J C, WANG D L, CHEN J C, et al. Make Graph-Based Refe-rring Expression Comprehension Great Again Through Expression-Guided Dynamic Gating and Regression. IEEE Transactions on Multimedia, 2025, 27: 1950-1961.
[29] BAI S, CHEN K Q, LIU X J, et al. Qwen2.5-VL Technical Report[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2502.13923.
[30] WANG X D, ZHANG S L, LI S F, et al. SegLLM: Multi-round Reasoning Segmentation with Large Language Models[C/OL]. [2025-12-17]. https://arxiv.org/pdf/2410.18923.